fix(engine): lockless waitpoint insert for batch items to eliminate lock contention by ericallam · Pull Request #3232 · triggerdotdev/trigger.dev

ericallam · 2026-03-17T15:57:47Z

When processing batchTriggerAndWait items, each batch item was acquiring a
Redis lock on the parent run to insert a TaskRunWaitpoint row. With high
concurrency (processingConcurrency=50), this caused LockAcquisitionTimeoutError
(880 errors/24h in prod), orphaned runs, and stuck parent runs.

Since blockRunWithCreatedBatch already transitions the parent to
EXECUTING_WITH_WAITPOINTS before items are processed, the per-item lock is
unnecessary. The new blockRunWithWaitpointLockless method performs only the
idempotent CTE insert and timeout scheduling without acquiring the lock.

…ock contention When processing batchTriggerAndWait items, each batch item was acquiring a Redis lock on the parent run to insert a TaskRunWaitpoint row. With high concurrency (processingConcurrency=50), this caused LockAcquisitionTimeoutError (880 errors/24h in prod), orphaned runs, and stuck parent runs. Since blockRunWithCreatedBatch already transitions the parent to EXECUTING_WITH_WAITPOINTS before items are processed, the per-item lock is unnecessary. The new blockRunWithWaitpointLockless method performs only the idempotent CTE insert and timeout scheduling without acquiring the lock. refs TRI-7837

changeset-bot · 2026-03-17T15:57:53Z

⚠️ No Changeset found

Latest commit: d840332

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-03-17T15:58:16Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 35851702-6810-4094-bd98-1e22a5e7570a

📥 Commits

Reviewing files that changed from the base of the PR and between 34c8400 and d840332.

📒 Files selected for processing (1)

internal-packages/run-engine/src/engine/tests/locking.test.ts

✅ Files skipped from review due to trivial changes (1)

internal-packages/run-engine/src/engine/tests/locking.test.ts

📜 Recent review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)

GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: sdk-compat / Deno Runtime
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: sdk-compat / Cloudflare Workers
GitHub Check: sdk-compat / Bun Runtime
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
GitHub Check: typecheck / typecheck
GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)

Walkthrough

Adds a lockless waitpoint insertion path to reduce Redis lock contention for batchTriggerAndWait processing. A new method, blockRunWithWaitpointLockless, was introduced in WaitpointSystem to perform idempotent CTE inserts (ON CONFLICT DO NOTHING) of TaskRunWaitpoint and _WaitpointRunConnections and to enqueue timeout jobs when provided; it intentionally avoids acquiring the parent run lock. The engine trigger flow was changed to call the lockless path for batch items and retain the locked path for non-batch items. The new method was duplicated in the file. The RunLocker test timeout was increased.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description explains the problem, solution, and rationale, but does not follow the required template structure with explicit checklist, testing steps, and changelog sections.	Add the template structure including the checklist, explicit Testing and Changelog sections, and verify all checklist items are addressed.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: introducing a lockless waitpoint insert for batch processing to eliminate lock contention.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/tri-7837-batchtriggerandwait-lock-contention-on-parent-run-causes

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (2)

540-559: Guard the lockless path with an explicit state invariant check.

This method relies on a strict precondition (parent already in EXECUTING_WITH_WAITPOINTS). Add a fast-fail guard so future call-site drift doesn’t silently create blockers without the corresponding snapshot transition.

Suggested hardening

   async blockRunWithWaitpointLockless({
@@
   }): Promise<void> {
     const prisma = tx ?? this.$.prisma;
     const $waitpoints = typeof waitpoints === "string" ? [waitpoints] : waitpoints;
+    const snapshot = await getLatestExecutionSnapshot(prisma, runId);
+    if (snapshot.executionStatus !== "EXECUTING_WITH_WAITPOINTS") {
+      throw new Error(
+        `blockRunWithWaitpointLockless requires EXECUTING_WITH_WAITPOINTS; got ${snapshot.executionStatus}`
+      );
+    }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts` around
lines 540 - 559, Add a fast-fail state invariant check at the start of
blockRunWithWaitpointLockless: fetch the parent/run record for runId (using the
same prisma client used in the function) and verify its state equals the
expected EXECUTING_WITH_WAITPOINTS constant; if not, throw or return an explicit
error indicating the precondition failed. Place this guard immediately after
resolving prisma and $waitpoints so the function fails fast on incorrect
call-site state drift rather than proceeding to create blockers. Use the
function name blockRunWithWaitpointLockless, the runId param, and the
EXECUTING_WITH_WAITPOINTS state constant to locate where to add the check.

528-605: Add observability instrumentation and crumbs to this new system method.

Please wrap this path with tracer/meter instrumentation (low-cardinality attributes only) and add @crumbs markers around insert + timeout scheduling branches.

As per coding guidelines internal-packages/run-engine/src/engine/systems/**/*.ts: Integrate OpenTelemetry tracer and meter instrumentation in RunEngine systems for observability; and **/*.{ts,tsx,js}: Add crumbs as you write code using // @Crumbscomments or //#region @Crumbs blocks for agentcrumbs debug tracing.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts` around
lines 528 - 605, Wrap the body of blockRunWithWaitpointLockless in an
OpenTelemetry span (use this.$.tracer.startActiveSpan or equivalent) and record
low-cardinality attributes (runId, projectId, batchId, batchIndex,
waitpointsCount, spanIdToComplete, and a boolean batch flag); use a meter
counter to record successful insert attempts and timeout-job scheduling counts.
Place agent-crumbs markers around the CTE insert block and around the timeout
scheduling loop (e.g. // `@crumbs` begin insert, // `@crumbs` end insert, and //
`@crumbs` begin schedule-timeouts, // `@crumbs` end schedule-timeouts). Ensure
exceptions inside the span are recorded and rethrown, and keep waitpoint IDs out
of span attributes to avoid high-cardinality telemetry.

internal-packages/run-engine/src/engine/index.ts (1)

731-756: Add @crumbs tracing to the new batch/non-batch waitpoint branch.

Please add crumbs on this new execution fork so lockless-vs-locked routing is visible in agent debugging traces.

As per coding guidelines **/*.{ts,tsx,js}: Add crumbs as you write code using // @Crumbscomments or //#region @Crumbs blocks for agentcrumbs debug tracing.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal-packages/run-engine/src/engine/index.ts` around lines 731 - 756, The
new branching between lockless and locked waitpoint paths (inside the
surrounding method that calls this.waitpointSystem) lacks `@crumbs` tracing; add
// `@crumbs` comments before the lockless call (blockRunWithWaitpointLockless) and
before the locked call (blockRunWithWaitpoint) so agent traces show which path
was taken. Each crumb should be a single-line comment with a clear message (e.g.
"waitpoint:lockless" vs "waitpoint:locked") and include key context such as
runId (parentTaskRunId), waitpoint id (taskRun.associatedWaitpoint.id) and batch
flag so debugging traces can correlate runs. Place the crumbs immediately above
the respective await calls in the same function so the agentcrumbs tooling picks
them up. Ensure you follow the project's crumb format used elsewhere (// `@crumbs`
...) for consistency.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal-packages/run-engine/src/engine/index.ts`:
- Around line 731-756: The new branching between lockless and locked waitpoint
paths (inside the surrounding method that calls this.waitpointSystem) lacks
`@crumbs` tracing; add // `@crumbs` comments before the lockless call
(blockRunWithWaitpointLockless) and before the locked call
(blockRunWithWaitpoint) so agent traces show which path was taken. Each crumb
should be a single-line comment with a clear message (e.g. "waitpoint:lockless"
vs "waitpoint:locked") and include key context such as runId (parentTaskRunId),
waitpoint id (taskRun.associatedWaitpoint.id) and batch flag so debugging traces
can correlate runs. Place the crumbs immediately above the respective await
calls in the same function so the agentcrumbs tooling picks them up. Ensure you
follow the project's crumb format used elsewhere (// `@crumbs` ...) for
consistency.

In `@internal-packages/run-engine/src/engine/systems/waitpointSystem.ts`:
- Around line 540-559: Add a fast-fail state invariant check at the start of
blockRunWithWaitpointLockless: fetch the parent/run record for runId (using the
same prisma client used in the function) and verify its state equals the
expected EXECUTING_WITH_WAITPOINTS constant; if not, throw or return an explicit
error indicating the precondition failed. Place this guard immediately after
resolving prisma and $waitpoints so the function fails fast on incorrect
call-site state drift rather than proceeding to create blockers. Use the
function name blockRunWithWaitpointLockless, the runId param, and the
EXECUTING_WITH_WAITPOINTS state constant to locate where to add the check.
- Around line 528-605: Wrap the body of blockRunWithWaitpointLockless in an
OpenTelemetry span (use this.$.tracer.startActiveSpan or equivalent) and record
low-cardinality attributes (runId, projectId, batchId, batchIndex,
waitpointsCount, spanIdToComplete, and a boolean batch flag); use a meter
counter to record successful insert attempts and timeout-job scheduling counts.
Place agent-crumbs markers around the CTE insert block and around the timeout
scheduling loop (e.g. // `@crumbs` begin insert, // `@crumbs` end insert, and //
`@crumbs` begin schedule-timeouts, // `@crumbs` end schedule-timeouts). Ensure
exceptions inside the span are recorded and rethrown, and keep waitpoint IDs out
of span attributes to avoid high-cardinality telemetry.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 15f9a942-52b3-4d81-9d6f-83e06f97c480

📥 Commits

Reviewing files that changed from the base of the PR and between dbbe9f7 and 34c8400.

📒 Files selected for processing (3)

.server-changes/fix-batch-waitpoint-lock-contention.md
internal-packages/run-engine/src/engine/index.ts
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (26)

GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
GitHub Check: typecheck / typecheck
GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
GitHub Check: sdk-compat / Bun Runtime
GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
GitHub Check: sdk-compat / Cloudflare Workers
GitHub Check: sdk-compat / Deno Runtime
GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Use task export syntax: export const myTask = task({ id: 'my-task', run: async (payload) => { ... } })
Use Run Engine 2.0 (@internal/run-engine) and redis-worker for all new work - avoid DEPRECATED zodworker (Graphile-worker wrapper)
Prisma 6.14.0 client and schema use PostgreSQL in internal-packages/database - import only from Prisma client

Files:

internal-packages/run-engine/src/engine/index.ts
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

internal-packages/run-engine/src/engine/index.ts
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts

**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

internal-packages/run-engine/src/engine/index.ts
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts

**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

internal-packages/run-engine/src/engine/index.ts
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts

**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx,js}: Always import from @trigger.dev/sdk for Trigger.dev tasks - never use @trigger.dev/sdk/v3 or deprecated client.defineJob
Import subpaths only from @trigger.dev/core, never import from root
Add crumbs as you write code using // @crumbs comments or // #region @crumbs blocks for agentcrumbs debug tracing

Files:

internal-packages/run-engine/src/engine/index.ts
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts

internal-packages/run-engine/src/engine/systems/**/*.ts

📄 CodeRabbit inference engine (internal-packages/run-engine/CLAUDE.md)

Integrate OpenTelemetry tracer and meter instrumentation in RunEngine systems for observability

Files:

internal-packages/run-engine/src/engine/systems/waitpointSystem.ts

🧠 Learnings (13)

📓 Common learnings

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: internal-packages/run-engine/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:25.254Z
Learning: Use Redis for distributed locks and queues in the RunEngine system via `RunQueue` and `RunLocker` utilities

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerByTaskAndWait()` to batch trigger tasks by passing task instances and wait for results

Applied to files:

.server-changes/fix-batch-waitpoint-lock-contention.md
internal-packages/run-engine/src/engine/index.ts
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerAndWait()` to batch trigger multiple different tasks and wait for results

Applied to files:

.server-changes/fix-batch-waitpoint-lock-contention.md
internal-packages/run-engine/src/engine/index.ts
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `yourTask.batchTriggerAndWait()` to batch trigger tasks and wait for all results from a parent task

Applied to files:

internal-packages/run-engine/src/engine/index.ts
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `runs.subscribeToBatch()` to subscribe to changes for all runs in a batch

Applied to files:

internal-packages/run-engine/src/engine/index.ts

📚 Learning: 2026-03-02T12:42:56.114Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/webapp/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:56.114Z
Learning: Applies to apps/webapp/app/v3/services/**/*.server.ts : When editing services that branch on `RunEngineVersion` to support both V1 and V2 (e.g., `cancelTaskRun.server.ts`, `batchTriggerV3.server.ts`), only modify V2 code paths

Applied to files:

internal-packages/run-engine/src/engine/index.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `yourTask.batchTrigger()` to trigger multiple runs of a task from inside another task

Applied to files:

internal-packages/run-engine/src/engine/index.ts

📚 Learning: 2026-03-02T12:43:43.173Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: packages/redis-worker/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:43.173Z
Learning: Applies to packages/redis-worker/**/redis-worker/src/worker.ts : Worker loop and job processing should implement concurrency control in src/worker.ts

Applied to files:

internal-packages/run-engine/src/engine/index.ts

📚 Learning: 2026-03-13T13:37:49.544Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T13:37:49.544Z
Learning: Applies to **/*.{ts,tsx} : Use Run Engine 2.0 (internal/run-engine) and redis-worker for all new work - avoid DEPRECATED zodworker (Graphile-worker wrapper)

Applied to files:

internal-packages/run-engine/src/engine/index.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.trigger()` to trigger multiple different tasks at once from backend code

Applied to files:

internal-packages/run-engine/src/engine/index.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `tasks.batchTrigger()` to trigger multiple runs of a single task with different payloads

Applied to files:

internal-packages/run-engine/src/engine/index.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `yourTask.triggerAndWait()` to trigger a task and wait for its result from a parent task

Applied to files:

internal-packages/run-engine/src/engine/index.ts

📚 Learning: 2025-11-27T16:27:35.304Z

Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerByTask()` to batch trigger tasks by passing task instances for static task sets

Applied to files:

internal-packages/run-engine/src/engine/index.ts

🔇 Additional comments (1)

.server-changes/fix-batch-waitpoint-lock-contention.md (1)

1-6: Clear and actionable release note.

Good summary of the production symptom, root cause, and behavioral change.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

ericallam · 2026-03-17T16:37:31Z

Needs approval

## Summary 12 new features, 59 improvements, 17 bug fixes. ## Highlights - Add support for setting TTL (time-to-live) defaults at the task level and globally in trigger.config.ts, with per-trigger overrides still taking precedence ([#3196](#3196)) - Large run outputs can use the new API which allows switching object storage providers. ([#3275](#3275)) ## Improvements - Add platform notifications support to the CLI. The `trigger dev` and `trigger login` commands now fetch and display platform notifications (info, warn, error, success) from the server. Includes discovery-based filtering to conditionally show notifications based on project file patterns, color markup rendering for styled terminal output, and a non-blocking display flow with a spinner fallback for slow fetches. Use `--skip-platform-notifications` flag with `trigger dev` to disable the notification check. ([#3254](#3254)) - Add `get_span_details` MCP tool for inspecting individual spans within a run trace. ([#3255](#3255)) - New `get_span_details` tool returns full span attributes, timing, events, and AI enrichment (model, tokens, cost, speed) - Span IDs now shown in `get_run_details` trace output for easy discovery - New API endpoint `GET /api/v1/runs/:runId/spans/:spanId` - New `retrieveSpan()` method on the API client - `get_query_schema` — discover available TRQL tables and columns - `query` — execute TRQL queries against your data - `list_dashboards` — list built-in dashboards and their widgets - `run_dashboard_query` — execute a single dashboard widget query - `whoami` — show current profile, user, and API URL - `list_profiles` — list all configured CLI profiles - `switch_profile` — switch active profile for the MCP session - `start_dev_server` — start `trigger dev` in the background and stream output - `stop_dev_server` — stop the running dev server - `dev_server_status` — check dev server status and view recent logs - `GET /api/v1/query/schema` — query table schema discovery - `GET /api/v1/query/dashboards` — list built-in dashboards - `--readonly` flag hides write tools (`deploy`, `trigger_task`, `cancel_run`) so the AI cannot make changes - `read:query` JWT scope for query endpoint authorization - `get_run_details` trace output is now paginated with cursor support - MCP tool annotations (`readOnlyHint`, `destructiveHint`) for all tools - `get_query_schema` now requires a table name and returns only one table's schema (was returning all tables) - `get_current_worker` no longer inlines payload schemas; use new `get_task_schema` tool instead - Query results formatted as text tables instead of JSON (~50% fewer tokens) - `cancel_run`, `list_deploys`, `list_preview_branches` formatted as text instead of raw JSON - Schema and dashboard API responses cached to avoid redundant fetches - Adapted the CLI API client to propagate the trigger source via http headers. ([#3241](#3241)) - Propagate run tags to span attributes so they can be extracted server-side for LLM cost attribution metadata. ([#3213](#3213)) - New `get_span_details` tool returns full span attributes, timing, events, and AI enrichment (model, tokens, cost, speed) - Span IDs now shown in `get_run_details` trace output for easy discovery - New API endpoint `GET /api/v1/runs/:runId/spans/:spanId` - New `retrieveSpan()` method on the API client - `get_query_schema` — discover available TRQL tables and columns - `query` — execute TRQL queries against your data - `list_dashboards` — list built-in dashboards and their widgets - `run_dashboard_query` — execute a single dashboard widget query - `whoami` — show current profile, user, and API URL - `list_profiles` — list all configured CLI profiles - `switch_profile` — switch active profile for the MCP session - `start_dev_server` — start `trigger dev` in the background and stream output - `stop_dev_server` — stop the running dev server - `dev_server_status` — check dev server status and view recent logs - `GET /api/v1/query/schema` — query table schema discovery - `GET /api/v1/query/dashboards` — list built-in dashboards - `--readonly` flag hides write tools (`deploy`, `trigger_task`, `cancel_run`) so the AI cannot make changes - `read:query` JWT scope for query endpoint authorization - `get_run_details` trace output is now paginated with cursor support - MCP tool annotations (`readOnlyHint`, `destructiveHint`) for all tools - `get_query_schema` now requires a table name and returns only one table's schema (was returning all tables) - `get_current_worker` no longer inlines payload schemas; use new `get_task_schema` tool instead - Query results formatted as text tables instead of JSON (~50% fewer tokens) - `cancel_run`, `list_deploys`, `list_preview_branches` formatted as text instead of raw JSON - Schema and dashboard API responses cached to avoid redundant fetches - Add optional `hasPrivateLink` field to the dequeue message organization object for private networking support ([#3264](#3264)) - Define and manage AI prompts with `prompts.define()`. Create typesafe prompt templates with variables, resolve them at runtime, and manage versions and overrides from the dashboard without redeploying. ([#3244](#3244)) ## Bug fixes - Fix dev CLI leaking build directories on rebuild, causing disk space accumulation. Deprecated workers are now pruned (capped at 2 retained) when no active runs reference them. The watchdog process also cleans up `.trigger/tmp/` when the dev CLI is killed ungracefully (e.g. SIGKILL from pnpm). ([#3224](#3224)) - Fix `--load` flag being silently ignored on local/self-hosted builds. ([#3114](#3114)) - Fixed `search_docs` tool failing due to renamed upstream Mintlify tool (`SearchTriggerDev` → `search_trigger_dev`) - Fixed `list_deploys` failing when deployments have null `runtime`/`runtimeVersion` fields (#3139) - Fixed `list_preview_branches` crashing due to incorrect response shape access - Fixed `metrics` table column documented as `value` instead of `metric_value` in query docs - Fixed dev CLI leaking build directories on rebuild — deprecated workers now clean up their build dirs when their last run completes - Fixed `search_docs` tool failing due to renamed upstream Mintlify tool (`SearchTriggerDev` → `search_trigger_dev`) - Fixed `list_deploys` failing when deployments have null `runtime`/`runtimeVersion` fields (#3139) - Fixed `list_preview_branches` crashing due to incorrect response shape access - Fixed `metrics` table column documented as `value` instead of `metric_value` in query docs - Fixed dev CLI leaking build directories on rebuild — deprecated workers now clean up their build dirs when their last run completes ## Server changes These changes affect the self-hosted Docker image and Trigger.dev Cloud: - Add admin UI for viewing and editing feature flags (org-level overrides and global defaults). ([#3291](#3291)) - AI prompt management dashboard and enhanced span inspectors. **Prompt management:** - Prompts list page with version status, model, override indicators, and 24h usage sparklines - Prompt detail page with template viewer, variable preview, version history timeline, and override editor - Create, edit, and remove overrides to change prompt content or model without redeploying - Promote any code-deployed version to current - Generations tab with infinite scroll, live polling, and inline span inspector - Per-prompt metrics: total generations, avg tokens, avg cost, latency, with version-level breakdowns **AI span inspectors:** - Custom inspectors for `ai.generateText`, `ai.streamText`, `ai.generateObject`, `ai.streamObject` parent spans - `ai.toolCall` inspector showing tool name, call ID, and input arguments - `ai.embed` inspector showing model, provider, and input text - Prompt tab on AI spans linking to prompt version with template and input variables - Compact timestamp and duration header on all AI span inspectors **AI metrics dashboard:** - Operations, Providers, and Prompts filters on the AI Metrics dashboard - Cost by prompt widget - "AI" section in the sidebar with Prompts and AI Metrics links **Other improvements:** - Resizable panel sizes now persist across page refreshes - Fixed `<div>` inside `<p>` DOM nesting warnings in span titles and chat messages ([#3244](#3244)) - Add allowRollbacks query param to the promote deployment API to enable version downgrades ([#3214](#3214)) - Pre-warm compute templates on deploy for orgs with compute access. Required for projects using a compute region, background-only for others. ([#3114](#3114)) - Add automatic LLM cost calculation for spans with GenAI semantic conventions. When a span arrives with `gen_ai.response.model` and token usage data, costs are calculated from an in-memory pricing registry backed by Postgres and dual-written to both span attributes (`trigger.llm.*`) and a new `llm_metrics_v1` ClickHouse table that captures usage, cost, performance (TTFC, tokens/sec), and behavioral (finish reason, operation type) metrics. ([#3213](#3213)) - Add API endpoint `GET /api/v1/runs/:runId/spans/:spanId` that returns detailed span information including properties, events, AI enrichment (model, tokens, cost), and triggered child runs. ([#3255](#3255)) - Multi-provider object storage with protocol-based routing for zero-downtime migration ([#3275](#3275)) - Add IAM role-based auth support for object stores (no access keys required). ([#3275](#3275)) - Add platform notifications to inform users about new features, changelogs, and platform events directly in the dashboard. ([#3254](#3254)) - Add private networking support via AWS PrivateLink. Includes BillingClient methods for managing private connections, org settings UI pages for connection management, and supervisor changes to apply `privatelink` pod labels for CiliumNetworkPolicy matching. ([#3264](#3264)) - Reduce run start latency by skipping the intermediate queue when concurrency is available. This optimization is rolled out per-region and enabled automatically for development environments. ([#3299](#3299)) - Extended the search filter on the environment variables page to match on environment type (production, staging, development, preview) and branch name, not just variable name and value. ([#3302](#3302)) - Set `application_name` on Prisma connections from SERVICE_NAME so DB load can be attributed by service ([#3348](#3348)) - Fix transient R2/object store upload failures during batchTrigger() item streaming. - Added p-retry (3 attempts, 500ms–2s exponential backoff) around `uploadPacketToObjectStore` in `BatchPayloadProcessor.process()` so transient network errors self-heal server-side rather than aborting the entire batch stream. - Removed `x-should-retry: false` from the 500 response on the batch items route so the SDK's existing 5xx retry path can recover if server-side retries are exhausted. Item deduplication by index makes full-stream retries safe. ([#3331](#3331)) - Concurrency-keyed queues now use a single master queue entry per base queue instead of one entry per key. Prevents high-CK-count tenants from consuming the entire parentQueueLimit window and starving other tenants on the same shard. ([#3219](#3219)) - Reduce lock contention when processing large `batchTriggerAndWait` batches. Previously, each batch item acquired a Redis lock on the parent run to insert a `TaskRunWaitpoint` row, causing `LockAcquisitionTimeoutError` with high concurrency (880 errors/24h in prod). Since `blockRunWithCreatedBatch` already transitions the parent to `EXECUTING_WITH_WAITPOINTS` before items are processed, the per-item lock is unnecessary. The new `blockRunWithWaitpointLockless` method performs only the idempotent CTE insert without acquiring the lock. ([#3232](#3232)) - Strip `secure` query parameter from QUERY_CLICKHOUSE_URL before passing to ClickHouse client. This was already done for the main and logs ClickHouse clients but was missing for the query client, causing a startup crash with `Error: Unknown URL parameters: secure`. ([#3204](#3204)) - Fix `OrganizationsPresenter.#getEnvironment` matching the wrong development environment on teams with multiple members. All dev environments share the slug `"dev"`, so the previous `find` by slug alone could return another member's environment. Now filters DEVELOPMENT environments by `orgMember.userId` to ensure the logged-in user's dev environment is selected. ([#3273](#3273)) <details> <summary>Raw changeset output</summary> # Releases ## @trigger.dev/build@4.4.4 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.4` ## trigger.dev@4.4.4 ### Patch Changes - Add platform notifications support to the CLI. The `trigger dev` and `trigger login` commands now fetch and display platform notifications (info, warn, error, success) from the server. Includes discovery-based filtering to conditionally show notifications based on project file patterns, color markup rendering for styled terminal output, and a non-blocking display flow with a spinner fallback for slow fetches. Use `--skip-platform-notifications` flag with `trigger dev` to disable the notification check. ([#3254](#3254)) - Fix dev CLI leaking build directories on rebuild, causing disk space accumulation. Deprecated workers are now pruned (capped at 2 retained) when no active runs reference them. The watchdog process also cleans up `.trigger/tmp/` when the dev CLI is killed ungracefully (e.g. SIGKILL from pnpm). ([#3224](#3224)) - Fix `--load` flag being silently ignored on local/self-hosted builds. ([#3114](#3114)) - Add `get_span_details` MCP tool for inspecting individual spans within a run trace. ([#3255](#3255)) - New `get_span_details` tool returns full span attributes, timing, events, and AI enrichment (model, tokens, cost, speed) - Span IDs now shown in `get_run_details` trace output for easy discovery - New API endpoint `GET /api/v1/runs/:runId/spans/:spanId` - New `retrieveSpan()` method on the API client - MCP server improvements: new tools, bug fixes, and new flags. ([#3224](#3224)) **New tools:** - `get_query_schema` — discover available TRQL tables and columns - `query` — execute TRQL queries against your data - `list_dashboards` — list built-in dashboards and their widgets - `run_dashboard_query` — execute a single dashboard widget query - `whoami` — show current profile, user, and API URL - `list_profiles` — list all configured CLI profiles - `switch_profile` — switch active profile for the MCP session - `start_dev_server` — start `trigger dev` in the background and stream output - `stop_dev_server` — stop the running dev server - `dev_server_status` — check dev server status and view recent logs **New API endpoints:** - `GET /api/v1/query/schema` — query table schema discovery - `GET /api/v1/query/dashboards` — list built-in dashboards **New features:** - `--readonly` flag hides write tools (`deploy`, `trigger_task`, `cancel_run`) so the AI cannot make changes - `read:query` JWT scope for query endpoint authorization - `get_run_details` trace output is now paginated with cursor support - MCP tool annotations (`readOnlyHint`, `destructiveHint`) for all tools **Bug fixes:** - Fixed `search_docs` tool failing due to renamed upstream Mintlify tool (`SearchTriggerDev` → `search_trigger_dev`) - Fixed `list_deploys` failing when deployments have null `runtime`/`runtimeVersion` fields (#3139) - Fixed `list_preview_branches` crashing due to incorrect response shape access - Fixed `metrics` table column documented as `value` instead of `metric_value` in query docs - Fixed dev CLI leaking build directories on rebuild — deprecated workers now clean up their build dirs when their last run completes **Context optimizations:** - `get_query_schema` now requires a table name and returns only one table's schema (was returning all tables) - `get_current_worker` no longer inlines payload schemas; use new `get_task_schema` tool instead - Query results formatted as text tables instead of JSON (~50% fewer tokens) - `cancel_run`, `list_deploys`, `list_preview_branches` formatted as text instead of raw JSON - Schema and dashboard API responses cached to avoid redundant fetches - Add support for setting TTL (time-to-live) defaults at the task level and globally in trigger.config.ts, with per-trigger overrides still taking precedence ([#3196](#3196)) - Adapted the CLI API client to propagate the trigger source via http headers. ([#3241](#3241)) - Updated dependencies: - `@trigger.dev/core@4.4.4` - `@trigger.dev/build@4.4.4` - `@trigger.dev/schema-to-json@4.4.4` ## @trigger.dev/core@4.4.4 ### Patch Changes - Fix `list_deploys` MCP tool failing when deployments have null `runtime` or `runtimeVersion` fields. ([#3224](#3224)) - Propagate run tags to span attributes so they can be extracted server-side for LLM cost attribution metadata. ([#3213](#3213)) - Add `get_span_details` MCP tool for inspecting individual spans within a run trace. ([#3255](#3255)) - New `get_span_details` tool returns full span attributes, timing, events, and AI enrichment (model, tokens, cost, speed) - Span IDs now shown in `get_run_details` trace output for easy discovery - New API endpoint `GET /api/v1/runs/:runId/spans/:spanId` - New `retrieveSpan()` method on the API client - MCP server improvements: new tools, bug fixes, and new flags. ([#3224](#3224)) **New tools:** - `get_query_schema` — discover available TRQL tables and columns - `query` — execute TRQL queries against your data - `list_dashboards` — list built-in dashboards and their widgets - `run_dashboard_query` — execute a single dashboard widget query - `whoami` — show current profile, user, and API URL - `list_profiles` — list all configured CLI profiles - `switch_profile` — switch active profile for the MCP session - `start_dev_server` — start `trigger dev` in the background and stream output - `stop_dev_server` — stop the running dev server - `dev_server_status` — check dev server status and view recent logs **New API endpoints:** - `GET /api/v1/query/schema` — query table schema discovery - `GET /api/v1/query/dashboards` — list built-in dashboards **New features:** - `--readonly` flag hides write tools (`deploy`, `trigger_task`, `cancel_run`) so the AI cannot make changes - `read:query` JWT scope for query endpoint authorization - `get_run_details` trace output is now paginated with cursor support - MCP tool annotations (`readOnlyHint`, `destructiveHint`) for all tools **Bug fixes:** - Fixed `search_docs` tool failing due to renamed upstream Mintlify tool (`SearchTriggerDev` → `search_trigger_dev`) - Fixed `list_deploys` failing when deployments have null `runtime`/`runtimeVersion` fields (#3139) - Fixed `list_preview_branches` crashing due to incorrect response shape access - Fixed `metrics` table column documented as `value` instead of `metric_value` in query docs - Fixed dev CLI leaking build directories on rebuild — deprecated workers now clean up their build dirs when their last run completes **Context optimizations:** - `get_query_schema` now requires a table name and returns only one table's schema (was returning all tables) - `get_current_worker` no longer inlines payload schemas; use new `get_task_schema` tool instead - Query results formatted as text tables instead of JSON (~50% fewer tokens) - `cancel_run`, `list_deploys`, `list_preview_branches` formatted as text instead of raw JSON - Schema and dashboard API responses cached to avoid redundant fetches - Large run outputs can use the new API which allows switching object storage providers. ([#3275](#3275)) - Add optional `hasPrivateLink` field to the dequeue message organization object for private networking support ([#3264](#3264)) - Add support for setting TTL (time-to-live) defaults at the task level and globally in trigger.config.ts, with per-trigger overrides still taking precedence ([#3196](#3196)) - Adapted the CLI API client to propagate the trigger source via http headers. ([#3241](#3241)) ## @trigger.dev/python@4.4.4 ### Patch Changes - Updated dependencies: - `@trigger.dev/sdk@4.4.4` - `@trigger.dev/core@4.4.4` - `@trigger.dev/build@4.4.4` ## @trigger.dev/react-hooks@4.4.4 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.4` ## @trigger.dev/redis-worker@4.4.4 ### Patch Changes - Adapted the CLI API client to propagate the trigger source via http headers. ([#3241](#3241)) - Updated dependencies: - `@trigger.dev/core@4.4.4` ## @trigger.dev/rsc@4.4.4 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.4` ## @trigger.dev/schema-to-json@4.4.4 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.4.4` ## @trigger.dev/sdk@4.4.4 ### Patch Changes - Define and manage AI prompts with `prompts.define()`. Create typesafe prompt templates with variables, resolve them at runtime, and manage versions and overrides from the dashboard without redeploying. ([#3244](#3244)) - Add support for setting TTL (time-to-live) defaults at the task level and globally in trigger.config.ts, with per-trigger overrides still taking precedence ([#3196](#3196)) - Adapted the CLI API client to propagate the trigger source via http headers. ([#3241](#3241)) - Updated dependencies: - `@trigger.dev/core@4.4.4` </details> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

try to fix flaky locking timeout test

d840332

ericallam marked this pull request as ready for review March 17, 2026 16:29

devin-ai-integration bot reviewed Mar 17, 2026

View reviewed changes

ericallam added the ready label Mar 17, 2026

nicktrn approved these changes Mar 17, 2026

View reviewed changes

ericallam merged commit 411803e into main Mar 17, 2026
41 checks passed

ericallam deleted the feature/tri-7837-batchtriggerandwait-lock-contention-on-parent-run-causes branch March 17, 2026 16:42

github-actions bot mentioned this pull request Mar 17, 2026

chore: release v4.4.4 #3228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(engine): lockless waitpoint insert for batch items to eliminate lock contention#3232

fix(engine): lockless waitpoint insert for batch items to eliminate lock contention#3232
ericallam merged 2 commits intomainfrom
feature/tri-7837-batchtriggerandwait-lock-contention-on-parent-run-causes

ericallam commented Mar 17, 2026

Uh oh!

changeset-bot bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

ericallam commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ericallam commented Mar 17, 2026

Uh oh!

changeset-bot bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

ericallam commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

changeset-bot bot commented Mar 17, 2026 •

edited

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading